── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
05_analysis_1.qmd
Library Load
Data Load
data <- read_tsv("../data/03_dat_aug.tsv")Rows: 4550676 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): Smoking, gene, is_significant
dbl (5): Metastasis, gene_expression, p_value, log2_fold_change_avg, log2_fo...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data# A tibble: 4,550,676 × 8
Metastasis Smoking gene gene_expression p_value is_significant
<dbl> <chr> <chr> <dbl> <dbl> <chr>
1 0 Former 1007_s_at 6.80 0.562 No
2 0 Former 1053_at 8.39 0.339 No
3 0 Former 117_at 3.89 0.905 No
4 0 Former 121_at 5.43 0.809 No
5 0 Former 1255_g_at 2.23 0.578 No
6 0 Former 1294_at 2.75 0.561 No
7 0 Former 1316_at 2.69 0.163 No
8 0 Former 1320_at 2.23 0.364 No
9 0 Former 1405_i_at 6.71 0.903 No
10 0 Former 1431_at 2.23 0.215 No
# ℹ 4,550,666 more rows
# ℹ 2 more variables: log2_fold_change_avg <dbl>, log2_fold_change_sample <dbl>
Analysis 1
In the first analysis, we want to identify which genes are found to be significantly different expressed in patients with metastatic cancer compared to non-metastatic cancer. Furthermore, we want to investigate if the gene expression is up-regulated of down-regulated.
The significance was calculated on the basis of a Student’s T-test where the expression of each gene was compared based on if the patients had metastasis or not.
The Log2 Fold Change for each gene was calculated based on the average gene expression level by comparing samples with metastasis and no metastasis.
Conclusion:
These results are shown in a volcano plot where the -10log(p-value) is shown on the y-axis and the Log2-Fold-Change is hown on the x-axis. Each dot represents a gene. From this, we can observe some genes, specifically 286 genes out of 48,932 genes, that are significantly different expressed in patients with metastasis on a significance level of 0.01. We can also observe that the significant genes typically are more up-regulated or down-regulated.
volcano_plot <- data |>
select(gene, log2_fold_change_avg, p_value) |>
unique() |>
mutate(log_10_p = -log10(p_value),
Significance = case_when(p_value > 0.01 ~ "Not significant",
p_value <= 0.01 ~ "Significant")) |>
ggplot(mapping = aes(x = log2_fold_change_avg,
y = log_10_p,
color = Significance)) +
geom_point(size = 1, alpha = 0.5) +
geom_hline(yintercept=2,
linetype="dotted",
color = "black",
size=0.5) +
theme(legend.position = "none") +
theme_minimal() +
labs(title="Genes Associated with Metastasis in Bladder Cancer",
subtitle = "Genes highlighted in turquoise are significant on a significance level of 0.01",
x = "Log2 Fold Change",
y = "-log10(p)") Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ggsave(
filename = "../results/05_volcano_plot.png",
plot = volcano_plot,
device = "png",
height = 5,
dpi = 300,
bg = "white"
)Saving 7 x 5 in image
print(volcano_plot)